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Abstract 

This paper describes a partial parser that as- 
signs syntactic structures to sequences of part- 
of-speech tags. The program uses the maximum 
entropy parameter estimation method, which al- 
lows a flexible combination of different know- 
ledge sources: the hierarchical structure, parts 
of speech and phrasal categories. In effect, the 
parser goes beyond simple bracketing and recog- 
nises even fairly complex structures. We give 
accuracy figures for different applications of the 
parser. 

1 Introduction 

The maximum entropy framework has proved 
to be a powerful modelling tool in many ar- 
eas of natural language processing. Its ap- 
plications range from sentence boundary dis- 
ambiguation (Reynar and Ratnaparkhi, 1997) 
to part-of-speech tagging (Ratnaparkhi, 1996), 
parsing (Ratnaparkhi, 1997) and machine trans- 
lation (Berger et al., 1996). 

In the present paper, we describe a partial 
parser based on the maximum entropy mod- 
elling method. After a synopsis of the maximum 
entropy framework in section 2, we present the 
motivation for our approach and the techniques 
it exploits (sections 3 and 4). Applications and 
results are the subject of the sections 5 and 6. 

2 Maximum Entropy Modelling 

The expressiveness and modelling power of the 
maximum entropy approach arise from its abil- 
ity to combine information coming from differ- 
ent knowledge sources. Given a set X of possi- 
ble histories and a set Y of futures, we can char- 
acterise events from the joint event space X, Y 
by defining a number of features, i.e., equiva- 
lence relations over X x Y. By defining these 



features, we express our insights about informa- 
tion relevant to modelling. 

In such a formalisation, the maximum en- 
tropy technique consists in finding a model that 
(a) fits the empirical expectations of the pre- 
defined features, and (b) does not assume any- 
thing specific about events that are not sub- 
ject to constraints imposed by the features. In 
other words, we search for the maximum en- 
tropy probability distribution p* : 



P 



argmaxff(p) 



where P = {p:p meets the empirical feature ex- 
pectations} and H (p) denotes the entropy of p. 

For parameter estimation, we can use the Im- 
proved Iterative Scaling (IIS) algorithm (Berger 
et al., 1996), which assumes p to have the form: 



p(x,y) 



1 



Z{x) 



K-fi(x,y) 



where : X x Y — > {0, 1} is the indicator func- 
tion of the i-th feature, Aj the weight assigned 
to this feature, and Z(x) a normalisation con- 
stant. IIS iteratively adjusts the weights (Aj) of 
the features; the model converges to the maxi- 
mum entropy distribution. 

One of the most attractive properties of the 
maximum entropy approach is its ability to cope 
with feature decomposition and overlapping fea- 
tures. In the following sections, we will show 
how these advantages can be exploited for par- 
tial parsing, i.e., the recognition of syntactic 
structures of limited depth. 

3 Context Information for Parsing 

An interesting feature of many partial parsers 
is that they recognise phrase boundaries mainly 
on the basis of cues provided by strictly local 



contexts. Regardless of whether or not abstrac- 
tions such as phrases occur in the model, most 
of the relevant information is contained directly 
in the sequence of words and part-of-speech tags 
to be processed. 

An archetypal representative of this approach 
is the method described by Church (1988), who 
used corpus frequencies to determine the bound- 
aries of simple non-recursive NPs. For each pair 
of part-of-speech tags tj, tj, the probability of an 
NP boundary ('[' or ']') occurring between tj and 
tj is computed. On the basis of these context 
probabilities, the program inserts the symbols 
'[' and ']' into sequences of part-of-speech tags. 

Information about lexical contexts also sig- 
nificantly improves the performance of deep 
parsers. For instance, Joshi and Srinivas 
(1994) encode partial structures in the Tree Ad- 
joining Grammar framework and use tagging 
techniques to restrict a potentially very large 
amount of alternative structures. Here, the con- 
text incorporates information about both the 
terminal yield and the syntactic structure built 
so far. 

Local configurations of words and parts of 
speech are a particularly important knowledge 
source for lexicalised grammars. In the Link 
Grammar framework (Lafferty et al., 1992; 
Delia Pietra et al., 1994), strictly local contexts 
are naturally combined with long-distance in- 
formation coming from long-range trigrams. 

Since modelling syntactic context is a very 
knowledge-intensive problem, the maximum en- 
tropy framework seems to be a particularly ap- 
propriate approach. Ratnaparkhi (1997) intro- 
duces several contextual predicates which pro- 
vide rich information about the syntactic con- 
text of nodes in a tree (basically, the structure 
and category of nodes dominated by or dom- 
inating the current phrase). These predicates 
are used to guide the actions of a parser. 

The use of a rich set of contextual features is 
also the basic idea of the approach taken by Her- 
mjakob and Mooney (1997), who employ predi- 
cates capturing syntactic and semantic context 
in their parsing and machine translation system. 

4 A Partial Parser for German 

The basic idea underlying our approach to par- 
tial parsing can be characterised as follows: 

• An appropriate encoding format makes it 



possible to express all relevant lexical, cat- 
egorial and structural information in a fi- 
nite alphabet of structural tags assigned to 
words (section 4.1). 

• Given a sequence of words tagged with 
part-of-speech labels, a Markov model is 
used to determine the most probable se- 
quence of structural tags (section 4.2). 

• Parameter estimation is based on the max- 
imum entropy technique, which takes full 
advantage of the multi-dimensional charac- 
ter of the structural tags (section 4.3). 

The details of the method employed are ex- 
plained in the remainder of this section. 

4.1 Relevant Contextual Information 

Three pieces of information associated with a 
word Wi are considered relevant to the parser: 

• the part-of-speech tag ti assigned to w; L 

• the structural relation rj between wi and 
its predecessor Wi-\ 

• the syntactic category Cj of parent(wi) 

On the basis of these three dimensions, struc- 
tural tags are defined as triples of the form 
Si = (ti,ri,Ci). For better readability, we will 
sometimes use attribute-value matrices to de- 
note such tags. 



Si 



TAG U 
REL n 
CAT a 



Since we consider structures of limited depth, 
only seven values of the REL attribute are dis- 
tinguished. 



if 


parent(wi) = 


= parent(wi-\) 


+ if 


parent(wi) - 


- parent 2 {wi-\) 


++ if 


parent(wi) = 


- parent 3 '(wi-±) 


- if 


parent 2 (wi) 


= parent(wi-i) 


— if 


parent 3 '(w^ 
parent 2 (wi) 


= parent(wi-i) 
= parent 2 (wi-i) 


= if 


1 


else 





If more than one of the conditions above are 
met, the first of the corresponding tags in the 



list is assigned. Figure 1 exemplifies the encod- 
ing format. 



r 2 = 





r 2 = + 





r 2 = ++ 



T2 = ~ 



r 2 = 



Wl Wj 
r 2 = ' = 



Figure 1: Tags r 2 assigned to word w 2 

These seven values of the ri attribute are 
mostly sufficient to represent the structure of 
even fairly complex NPs, PPs and APs, involv- 
ing PP and genitive NP attachment as well as 
complex prenominal modifiers. The only NP 
components that are not treated here are rela- 
tive clauses and infinitival complements. A Ger- 
man prepositional phrase and its encoding are 
shown in figure 2. 



4.2 A Markovian Parser 

The task of the parser is to determine the best 
sequence of triples Tj, a) for a given sequence 
of part-of-speech tags (to, t±, ...t n ). Since the 
attributes TAG, REL and CAT can take only 
a finite number of values, the number of such 
triples will also be finite, and they can be used 
to construct a 2-nd order Markov model. The 
triples Si = (ti,ri,Ci) are states of the model, 
which emits POS tags (tj) as signals. 

In this respect, our approach does not much 
differ from standard part-of-speech tagging 
techniques. We simply assign the most probable 
sequence of structural tags S = (So, Si, S n ) 
to a sequence of part-of-speech tags T = 
(to,t\, ...,t n ). Assuming the Markov property, 
we obtain: 



argmaxP(5|T) 

S 



= argmax P \S) ■ P(T\S) 
R 



(1) 



axgmaxn-P(5i|Si-2,^-i)P(ti|5i) 



R 



The part-of-speech tags are encoded in the 
structural tag (the tj dimension), so S uniquely 
determines T. Therefore, we have P(ti\Si) = 1 
if Si = (ti,ri,Ci) and otherwise, which simpli- 
fies calculations. 
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4.3 Parameter Estimation 

The more interesting aspect of our parser is the 
estimation of contextual probabilities, i.e., cal- 
culating the probability of a structural tag Si 
(the "future" ) conditional on its immediate pre- 
decessors Si-i and Si~ 2 (the "history"). 



history 


future 




"CAT: 


Ci- 


-2 




"CAT: 


Cj-l 






"CAT: a' 






REL: 


n- 


-2 




REL: 








REL: ri 






TAG: 


U- 


-2_ 




TAG: 


U-i_ 






TAG: U 





Figure 2: A sample structure. The labels are 
explained in Appendix B. 



In the following two subsections, we contrast 
the traditional HMM estimation method and 
the maximum entropy approach. 



4.3.1 Linear Interpolation 

One possible way of parameter estimation is to 
use standard HMM techniques while treating 
the triples Si = (U,Ci,ri) as atoms. Trigram 
probabilities are estimated from an annotated 
corpus by using relative frequencies r: 



r(Si\Si-2,Si-i) = 



fjSj-2, Sj-i, Si) 

f(Si-2, Si-i) 



A standard method of handling sparse data is to 
use a linear combination of unigrams, bigrams, 
and trigrams p: 

p(Si\Si- 2 ,Si-i) = Xir(Si) 

+\ 2 r(S i \S i - 1 ) 
+X 3 r(S i \S i - 2 ,S i - 1 ) 

The Xi denote weights for different context sizes 
and sum up to 1. They are commonly estimated 
by deleted interpolation (Brown et al., 1992). 

4.3.2 Features 

A disadvantage of the traditional method is that 
it considers only full n-grams Si- n +i, Si and 
ignores a lot of contextual information, such 
as regular behaviour of the single attributes 
TAG, REL and CAT. The maximum entropy 
approach offers an attractive alternative in this 
respect since we are now free to define fea- 
tures accessing different constellations of the at- 
tributes. For instance, we can abstract over one 
or more dimensions, like in the context descrip- 
tion in figure 1. 



history 



TAG: ART 



TAG: AD J A 
REL: 



future 



CAT: NP 
REL: 
TAG: NN 



Table 1: A partial trigram feature 

Such "partial n-grams" permit a better ex- 
ploitation of information coming from con- 
texts observed in the training data. We 
say that a feature ft defined by the triple 
(Mj_2, Mj_i, Mi) of attribute- value matrices is 
active on a trigram context {S[_ 2 , S'i-n $1) (i- e -> 
fk(S'i_ 2 , S$ = 1) iff Mj unifies with the 
attribute-value matrix Mj encoding the infor- 
mation contained in S[ for j = i — 2, i — 1,1 A 



novel context would on average activate more 
features than in the standard HMM approach, 
which treats the (U,ri,Ci) triples as atoms. 

The actual features are extracted from the 
training corpus in the following way: we first de- 
fine a number of feature patterns that say which 
attributes of a trigram context are relevant. All 
feature pattern instantiations that occur in the 
training corpus are stored; this procedure yields 
several thousands of features for each pattern. 

After computing the weights \ of the fea- 
tures occurring in the training sample, we can 
calculate the contextual probability of a multi- 
dimensional structural tag Si following the two 
tags Si-2 and Si-\: 



p(Si\Si-2, Si-i) 



1 



Z(x) 



=Si ^i'/i(Sj_2,Si-l,Si) 



We achieved the best results with 22 empir- 
ically determined feature patterns comprising 
full and partial n-grams, n < 3. These patterns 
are listed in Appendix A. 

5 Applications 

Below, we discuss two applications of our max- 
imum entropy parser: treebank annotation and 
chunk parsing of unrestricted text. For precise 
results, see section 6. 

5.1 Treebank Annotation 

The partial parser described here is used for cor- 
pus annotation in a treebank project, cf. (Skut 
et al., 1997). The annotation process is more in- 
teractive than in the Penn Treebank approach 
(Marcus et al., 1994), where a sentence is first 
preprocessed by a partial parser and then edited 
by a human annotator. In our method, man- 
ual and automatic annotation steps are closely 
interleaved. Figure 3 exemplifies the human- 
computer interaction during annotation. 

The annotations encode four kinds of linguis- 
tic information: 1) parts of speech and inflec- 
tion, 2) structure, 3) phrasal categories (node 
labels), 4) grammatical functions (edge labels). 

Part-of-speech tags are assigned in a prepro- 
cessing step. The automatic instantiation of la- 
bels is integrated into the assignment of struc- 
tures. The annotator marks the words and 
phrases to be grouped into a new substructure, 
and the node and edge labels are inserted by the 
program, cf. (Brants et al., 1997). 




Das Volumen lag in besseren Zeiten bei etwa acht Millionen Tonnen 

ART NN VVFIN APPR ADJA NN APPR ADV CARD NN NN $. 

Def.Neut.Nom.Sg Neut.Nom.Sg.* 3.Sg.Past.lnd Dat Comp.'.Dat.Pl.St Fem.Dat.PI.* Dat — — Fem.Dat.PI." Fern. Norn. PI.* 



Figure 3: A chunked sentence (in better times, the volume was around eight million tons). Gram- 
matical function labels: NK nominal kernel component, AC adposition, NMC number component, 
MO modifier. 



Initially, such annotation increments were 
just local trees of depth one. In this mode, the 
annotation of the PP bei etwa acht Millionen 
Tonnen ([at] around eight million tons) involves 
three annotation steps (first the number phrase 
acht Millionen, then the AP, and the PP). Each 
time, the annotator highlights the immediate 
constituents of the phrase being constructed. 

The use of the partial parser described in this 
paper makes it possible to construct the whole 
PP in only one step: The annotator marks the 
words dominated by the PP node, and the inter- 
nal structure of the new phrase is assigned auto- 
matically. This significantly reduces the amount 
of manual annotation work. The method yields 
reliable results in the case of phrases that ex- 
hibit a fairly rigid internal structure. More than 
88% of all NPs, PPs and APs are assigned the 
correct structure, including PP attachment and 
complex prenominal modifiers. 

Further examples of structures recognised by 
the parser are shown in figure 4. A more de- 
tailed description of the annotation mode can 
be found in (Brants and Skut, 1998). 

5.2 NP Chunker 

Apart from treebank annotation, our partial 
parser can be used to chunk part-of-speech 
tagged text into major phrases. Unlike in the 
previous application, the tool now has to deter- 
mine not only the internal structure, but also 
the external boundaries of phrases. This makes 
the task more difficult; especially for determin- 
ing PP attachment. 

However, if we restrict the coverage of the 
parser to the prenominal part of the NP/PP, it 



performs quite well, correctly assigning almost 
95% of all structural tags, which corresponds to 
a bracketing precision of ca. 87%. 

6 Results 

In this section, we report the results of a cross- 
validation of the parser carried out on the Ne- 
Gra Treebank (Skut et al., 1997). The corpus 
was converted into structural tags and parti- 
tioned into a training and a testing part (90% 
and 10%, respectively). We repeated this proce- 
dure ten times with different partitionings; the 
results of these test runs were averaged. 

The weights of the features used by the 
maximum entropy parser were determined with 
the help of the Maximum Entropy Modelling 
Toolkit, cf. (Ristad, 1996). The number of fea- 
tures reached 120,000 for the full training cor- 
pus (12,000 sentences). Interestingly, tagging 
accuracy decreased after after 4-5 iterations of 
Improved Iterative Scaling, so only 3 iterations 
were carried out in each of the test runs. 

The accuracy measures employed are ex- 
plained as follows. 

tags: the percentage of structural tags with the 
correct value rj of the REL attribute, 

bracketing: the percentage of correctly recog- 
nised nodes, 

labelled bracketing: like bracketing, but in- 
cluding the syntactic category of the nodes, 

structural match: the percentage of correctly 
recognised tree structures (top-level chunks 
only, labelling is ignored). 




Ein geradezu pathetischer Aufruf zum gemeinsamen Kampf fur einen gerechten Frieden 

ART ADV ADJA NN APPRART ADJA NN APPR ART ADJA NN 

An almost pathetic call for a joint fight for a just peace 




Eine Kostprobe aus einem fur Oktober geplanten Programm mit der experimentellen Verbindung zwischen Rock- und Chormusik 

ART NN APPR ART APPR NN ADJA NN APPR ART ADJA NN APPR TRUNC KON NN 

A sample of a for October planned program with the experimental link between rock and choir music 
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uber 
APPR 
about 



die 
ART 
the 



im 

APPRART 
in the 



Nachtragshaushalt 
NN 

additional budget 



<2> 



~[tL 



vorgesehene 
ADJA 
planned 



"X" 



Stellenverteilung 
NN 

allocation of jobs 



| MNR | 



in der 
APPR ART 
in the 



Verwaltung 
NN 

administration 



Figure 4: Examples of complex NPs and PPs correctly recognised by the parser. In the treebank 
application, such phrases are part of larger structures. The external boundaries (the first and 
the last word of the examples) are highlighted by an annotator, the parser recognises the internal 
boundaries and assigns labels. 



6.1 Treebank Application 

In the treebank application, information about 
the external boundaries of a phrase is supplied 
by an annotator. To imitate this situation, we 
extracted from the NeGra corpus all sequences 
of part-of-speech tags spanned by NPs PPs, 
APs and complex adverbials. Other tags were 
left out since they do not appear in chunks 
recognised by the parser. Thus, the sentence 



shown in figure 3 contributed three substrings 
to the chunk corpus: ART NN, APPR ADJA NN 
and APPR ADV CARD NN NN, which would also 
be typical annotator input. A designated sepa- 
rator character was used to mark chunk bound- 
aries. 

Table 2 shows the performance of the parser 
on the chunk corpus. 



Table 2: Recall and precision results for the in- 
teractive annotation mode. 



measure 


total 


correct 


recall 


prec. 


tags 


129822 


123435 


95.1% 


bracketing 


56715 


49715 


87.7% 


89.1% 


lab. brack. 


56715 


47415 


83.6% 


84.8% 


struct, match 


37942 


33450 


88.2% 


88.0% 



6.2 Chunking Application 

Table 3 shows precision and recall for the chunk- 
ing application, i.e., the recognition of kernel 
NPs and PPs in part-of-speech tagged text. 
Post-nominal PP attachment is ignored. Un- 
like in the treebank application, there is no pre- 
editing by a human expert. The absolute num- 
bers differ from those in table 2 because cer- 
tain structures are ignored. The total number 
of structural tags is higher since we now parse 
whole sentences rather then separate chunks. 

In addition to the four accuracy measures 
defined above, we also give the percentage 
of chunks with correctly recognised external 
boundaries (irrespective of whether or not there 
are errors concerning their internal structure). 

Table 3: Recall and precision for the chunk- 
ing application. The parser recognises only the 
prenominal part of the NP/PP (without focus 
adverbs such as also, only, etc.). 



measure 


total 


correct 


recall 


prec. 


tags 


166995 


158541 


94. 


9% 


bracketing 


51912 


45241 


87.2% 


86.9% 


lab. brack. 


51912 


43813 


84.4% 


84.2% 


struct, match 


46599 


41422 


88.9% 


87.6% 


ext. bounds 


46599 


43833 


94.1% 


93.4% 



6.3 Comparison to a Standard Tagger 

In the following, we compare the performance of 
the maximum-entropy parser with the precision 
of a standard HMM-based approach trained on 
the same data, but using only the frequencies 
of complete trigrams, bigrams and unigrams, 
whose probabilities are smoothed by linear in- 
terpolation, as described in section 4.3.1. 

Figure 5 shows the percentage of correctly as- 
signed values Ti of the REL attribute depending 



on the size of the training corpus. Generally, the 
maximum entropy approach outperforms the 
linear extrapolation technique by about 0.5% - 
1.5%, which corresponds to a 1% - 3% difference 
in structural match. The difference decreases as 
the size of the training sample grows. For the 
full corpus consisting of 12,000 sentences, the 
linear interpolation tagger is still inferior to the 
maximum entropy one, but the difference in pre- 
cision becomes insignificant (0.2%). Thus, the 
maximum entropy technique seems to particu- 
larly advantageous in the case of sparse data. 

7 Conclusion 

We have demonstrated a partial parser capa- 
ble of recognising simple and complex NPs, 
PPs and APs in unrestricted German text. 
The maximum entropy parameter estimation 
method allows us to optimally use the con- 
text information contained in the training sam- 
ple. On the other hand, the parser can still be 
viewed Markov model, which guarantees 

high efficiency (processing in linear time). The 
program can be trained even with a relatively 
small amount of treebank data; then it can be 
used for parsing unrestricted pre-tagged text. 

As far as coverage is concerned, our parser 
can handle recursive structures, which is an ad- 
vantage compared to simpler techniques such 
as that described by Church (1988). On the 
other hand, the Markov assumption underlying 
our approach means that only strictly local de- 
pendencies are recognised. For full parsing, one 
would probably need non-local contextual infor- 
mation, such as the long-range trigrams in Link 
Grammar (Delia Pietra et al., 1994). 

Our future research will focus on exploiting 
morphological and lexical knowledge for partial 
parsing. Lexical context is particularly relevant 
for the recognition of genitive NP and PP at- 
tachment, as well as complex proper names. We 
hope that our approach will benefit from re- 
lated work on this subject, cf. (Ratnaparkhi 
et al., 1994). Further precision gain can also 
be achieved by enriching the structural context, 
e.g. with information about the category of the 
grandparent node. 
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Appendix A: Feature Patterns 

Below, we give the 22 n-gram feature patterns 
used in our experiments. 



history 



.0) 



bC 



r,t,c 

r sibl ,t,C 

r, c 
t 

j,Slbl £ 

r,i 
r, c 
r 

r,t,c 
r,t,c 
t 



r,t,c 
r, c 

r, t 
r,t,c 
r,t,c 
r, t, c 
r,t,c 
r,t,c 
r,t,c 



a 

s-i 
bC 

s 



r,t,c 
r,t,c 
r,t,c 
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The symbols r (REL), t (TAG), and c (CAT) 
indicate which attributes are taken into account 
when generating a feature according to a partic- 
ular pattern. r sM is a binary-valued attribute 
saying whether the word under consideration 
and its immediate predecessor are siblings (i.e., 
whether or not r = 0). 

Appendix B: Tagsets 

This section contains descriptions of tags used 
in this paper. These are not complete lists. 

B.l Part-of-Speech Tags 

We use the Stuttgart-Tubingen- Tagset. The 
complete set is described in (Thielen and 
Schiller, 1995). 



ADJA 


attributive adjective 


ADV 


adverb 


APPR 


preposition 


APPRART 


preposition with determiner 


ART 


article 


CARD 


cardinal number 


KON 


Conjunction 


NE 


proper noun 


NN 


common noun 


PROAV 


pronominal adverb 


TRUNC 


first part of truncated noun 


VAFIN 


finite auxiliary 


VAINF 


infinite auxiliary 


VMFIN 


finite modal verb 


VVFIN 


finite verb 


WPP 


past participle of main verb 



B.2 Phrasal Categories 

AP adjective phrase 

MPN multi- word proper noun 

NM multi token numeral 

NP noun phrase 

PP prepositional phrase 

S sentence 

VP verb phrase 

B.3 Grammatical Functions 



AC 


adpositional case marker 


HD 


head 


MO 


modifier 


MNR 


post-nominal modifier 


NG 


negation 


NK 


noun kernel 


NMC 


numerical component 


OA 


accusative object 


OC 


clausal object 


PNC 


proper noun component 


SB 


subject 



,t 



